In this notebook, we do a brief exploration of our HR analytics data (found on Kaggle, which you can check for more info on the dataset) and try to discern which factors matter the most in determining why our personnel leave. The notebook will primarily be divided into two sections -- data analysis and machine learning.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
First, let's read in and get an overview of the data we'll be working with.
In [2]:
hr_data = pd.read_csv('../input/HR_comma_sep.csv')
hr_data.head()
Out[2]:
In [3]:
hr_data.describe()
Out[3]:
In [4]:
hr_data.info()
Conveniently, there is no missing data. Given that the "sales' and "salary" columns are non-numeric, we can check the number of unique levels and dummy code the variables.
In [5]:
print('Departments: ', ', '.join(hr_data['sales'].unique()))
print('Salary levels: ', ', '.join(hr_data['salary'].unique()))
In [6]:
hr_data.rename(columns={'sales':'department'}, inplace=True)
hr_data_new = pd.get_dummies(hr_data, ['department', 'salary'] ,drop_first = True)
In [7]:
hr_data_new.head()
Out[7]:
Observe that "IT" and "high" are the baseline levels for the assigned department and salary level, respectively. Also note that we saved the data with dummified variables as another dataframe in case we need to access the string values, such as for a cross-tabulation table.
In [8]:
# Correlation matrix
sns.heatmap(hr_data.corr(), annot=True)
Out[8]:
The matrix above shows that, generally speaking, the data is not correlated. This is good because it means we likely won't have issues with multicollinearity later.
It is notable, though perhaps unsurprising, that our employees' satisfaction level is the variable that is most highly correlated with them leaving.
In [9]:
hr_data_new.columns
Out[9]:
Let's first check if there are any particular departments that our people tend to be leaving from.
In [10]:
dept_table = pd.crosstab(hr_data['department'], hr_data['left'])
dept_table.index.names = ['Department']
dept_table
Out[10]:
We can check the above in terms of percentages to more easily see if there are particular departments that tend to have a higher proportion of people leaving.
In [11]:
dept_table_percentages = dept_table.apply(lambda row: (row/row.sum())*100, axis = 1)
dept_table_percentages
Out[11]:
R&D and management tend to have lower rates of leaving, and HR and accounting tend to have higher rates of leaving. The other departments are fairly similar, all between around 22 to 25 percent. We can also visualize the above data with a countplot.
In [12]:
sns.countplot(x='department', hue='left', data=hr_data)
Out[12]:
In [13]:
sns.boxplot(x='department', y='satisfaction_level', data=hr_data)
Out[13]:
While there doesn't appear to be too much of a difference in the satisfaction, we notice that both HR and accounting, the departments that have the highest rates of leaving, have slightly lower median satisfaction levels than the rest of the departments.
Salary is likely to have a high impact on leaving. In fact, it is highly likely that both R&D and management, the two departments with the lowers leaving rates, have high salaries. Let's first check the relationship between leaving and salary.
In [14]:
sns.countplot(x='salary', hue='left', data=hr_data)
Out[14]:
Confirming our hypothesis, those with low salaries tend to have the highest number of people that leave. Eyeballing the plot shows us that around 40% of those with low salaries leave and 25% of those with median salaries leave. It looks like only 10% of those with high salaries leave.
Let's also check the spread of satisfaction level between the different salary ranges.
In [15]:
sns.boxplot(x='salary', y='satisfaction_level', data=hr_data)
Out[15]:
Again, in line with some of our prior observations, low salary has the lowest median satisfaction and the highest spread.
Something that may impact employee perception in the company is the number of projects they are assigned.
In [16]:
sns.factorplot(x='number_project', y='last_evaluation', hue='department', data=hr_data)
Out[16]:
It is very clear that evaluation scores are affected by the number of projects assigned to the employee. What's more, we again notice a peculiar trend in accounting -- they have a lower last_evaluation score than the other departments at 7 projects.
In [17]:
sns.boxplot(x='number_project', y='satisfaction_level', data=hr_data_new)
Out[17]:
It looks like we've found a very important relationship -- those with high numbers of projects (6 or 7) tend to have extremely low satisfaction levels. This will likely play a role when we do our modeling. Also worth noting is that those with only 2 projects tend to also have lower satisfaction levels.
Let's take a look at time spent at the company and the effect of that on leaving. It was the third most correlated factor with leaving, so this should give us some usable information. We also check this in the context of two of the variables we previously studied, salary and department, to see if there are additional insights we can extract.
In [18]:
timeplot = sns.factorplot(x='time_spend_company', hue='left', y='department', row='salary', data=hr_data, aspect=2)
There is a clear trend for those with low and medium salaries -- those that leave tend to have spent more time at the company. For those with high salaries, leaving depends on the department. At the high salary level, time spent doesn't vary in accounting for those that left versus those that haven't but it varies pretty wildly for the support and IT departments.
Before we move on to the modeling section, let's take a look at accidents. This was the second most correlated factor with leaving, interestingly enough.
In [19]:
accidentplot = plt.figure(figsize=(10,6))
accidentplotax = accidentplot.add_axes([0,0,1,1])
accidentplotax = sns.violinplot(x='department', y='average_montly_hours', hue='Work_accident', split=True, data = hr_data, jitter = 0.47)
The difference is quite subtle, but the monthly hours (just noticed when I made this plot that the variable was spelled wrong in the dataset) seems to be bimodally distributed more often for those without work accidents versus those with.
Let's check a similar plot to see the relationship between leaving, work accidents, and satisfaction level.Let's check a similar plot to see the relationship between leaving, work accidents, and satisfaction level.
In [20]:
satisaccident = plt.figure(figsize=(10,6))
satisaccidentax = satisaccident.add_axes([0,0,1,1])
satisaccidentax = sns.violinplot(x='left', hue='Work_accident', y='satisfaction_level', split=True, data=hr_data)
What we see here is that there is a marked difference in the satisfaction level spreads of those that leave versus those that don't, with the peaks for those that left being slightly more pronounced for those that have not had workplace accidents, interestingly enough.
In [21]:
# We now use model_selection instead of cross_validation
from sklearn.model_selection import train_test_split
X = hr_data_new.drop('left', axis=1)
y = hr_data_new['left']
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size = 0.3, random_state = 47)
In [22]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
Out[22]:
While we will, of course, make predictions on our test set, we treat that as a holdout set and first do some cross-validation on our training set.
In [23]:
from sklearn.model_selection import cross_val_score
# Score first on our training data
print('Score: ', dt.score(X_train, y_train))
print('Cross validation score, 10-fold cv: \n', cross_val_score(dt, X_train, y_train, cv=10))
print('Mean cross validation score: ', cross_val_score(dt,X_train,y_train,cv=10).mean())
Our results are very good; showing a consistently high score for all folds of our 10-fold cross-validation using the training data. Let's make predictions and check the performance of the model on the holdout set in the same manner.
In [24]:
predictions = dt.predict(X_test)
print('Score: ', dt.score(X_test, y_test))
print('Cross validation score, 10-fold cv: \n', cross_val_score(dt, X, y, cv=10))
print('Mean cross validation score: ', cross_val_score(dt,X,y,cv=10).mean())
Once again, the model performs very well. Let's check on some additional classification metrics to see, in more detail, how our model does.
In [25]:
from sklearn.metrics import confusion_matrix, classification_report
print('Confusion matrix: \n', confusion_matrix(y_test, predictions), '\n')
print('Classification report: \n', classification_report(y_test, predictions))
On the basis of our 4500 test samples, our model is very accurate, with only 98 test cases wrong (or only around 2.2% wrong). All of the other metrics -- precision, recall, f1-score -- are also very good. Our model also doesn't appear to display any inherent bias in predicting one class.
We can also take a look at the ROC curve to determine the effectiveness of the test at correctly classifying those who stay and those who leave.
In [26]:
from sklearn.metrics import roc_curve, roc_auc_score
probabilities = dt.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, probabilities[:,1])
rates = pd.DataFrame({'False Positive Rate': fpr, 'True Positive Rate': tpr})
roc = plt.figure(figsize = (10,6))
rocax = roc.add_axes([0,0,1,1])
rocax.plot(fpr, tpr, color='g', label='Decision Tree')
rocax.plot([0,1],[0,1], color='gray', ls='--', label='Baseline (Random Guessing)')
rocax.set_xlabel('False Positive Rate')
rocax.set_ylabel('True Positive Rate')
rocax.set_title('ROC Curve')
rocax.legend()
print('Area Under the Curve:', roc_auc_score(y_test, probabilities[:,1]))
With a very high area under the curve of 0.977, our model is excellent at discriminating between those who stay and those who leave.
Let's check out the most important features, or those that are most influential in determining whether an employee leaves (or stays) in our company.
In [27]:
importances = dt.feature_importances_
print("Feature importances: \n")
for f in range(len(X.columns)):
print('•', X.columns[f], ":", importances[f])
To make it easier to interpret, we can order these from most important to least.
In [28]:
featureswithimportances = list(zip(X.columns, importances))
featureswithimportances.sort(key = lambda f: f[1], reverse=True)
print('Ordered feature importances: \n', '(From most important to least important)\n')
for f in range(len(featureswithimportances)):
print(f+1,". ", featureswithimportances[f][0], ": ", featureswithimportances[f][1])
In [60]:
sorted_features, sorted_importances = zip(*featureswithimportances)
plt.figure(figsize=(12,6))
sns.barplot(sorted_features, sorted_importances)
plt.title('Feature Importances (Gini Importance)')
plt.ylabel('Decrease in Node Impurity')
plt.xlabel('Feature')
plt.xticks(rotation=90);
Most of the variables we studied a while ago, including satisfaction_level, last_evaluation, time_spend_company, number_project, and average_montly_hours, appear to be important. By studying the relationships of these variables between those who have left and those who haven't, we can more accurately determine who's leaving and why.
Interestingly, both salary and department appeared to have a relatively small effect in our decision tree model. This could be caused by the fact that the preceding 5 factors already more accurately describe the conditions of the person who will leave, regardless of department, or the fact that the preceding 5 factors are already strongly correlated enough with salary and/or department to reduce their importance in our final model.
It's definitely worth taking our employees' satisfaction levels more seriously. We've discovered that this is related to, among other things, their salary and the number of projects they have. Further study could lead to finding an optimal combination of salary, number of projects, and other important factors in taking care of our people that could lead to better performance and profits for us and a lower employee mortality rate. It's also worth noting that time spent at the company and employee evaluations also have an important effect on whether employees leave or not -- this could ultimately be connected to their work, so it's worth investigating in more detail how departments handle project delegation to their employees and what kinds of projects they're given, especially given that those from HR and accounting tend to have higher leave rates than the other functions.